Project Members: 1) Ajinkya Desai, 2) Akash Bharsakle, 3) Asawari Kadam, 4) Prachi Kotkar
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%autosave 0
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.tree as tree
from IPython.display import Image
import pydotplus
from sklearn.cluster import KMeans
import warnings
#suppress all future warning
warnings.filterwarnings('ignore')
Autosave disabled
df = pd.read_csv('US_Accidents_March23.csv',index_col=0, parse_dates=True)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_rows', 100)
pd.set_option('display.min_rows', 100)
This dataset includes information about car accidents across the entire United States, covering 49 states (Except Alaska) spanning over a duration starting from February 2016 to March 2023. The dataset contains approximately 7.7 million accident records from all over USA.
General Info
Weather Situations
Road Conditions
Period of the Day
df.shape
(7728394, 45)
df.columns.values
array(['Source', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat',
'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description',
'Street', 'City', 'County', 'State', 'Zipcode', 'Country',
'Timezone', 'Airport_Code', 'Weather_Timestamp', 'Temperature(F)',
'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)',
'Wind_Direction', 'Wind_Speed(mph)', 'Precipitation(in)',
'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Give_Way',
'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
'Astronomical_Twilight'], dtype=object)
df.reset_index(inplace=True)
df.head()
| ID | Source | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | ... | Roundabout | Station | Stop | Traffic_Calming | Traffic_Signal | Turning_Loop | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A-1 | Source2 | 3 | 2016-02-08 05:46:00 | 2016-02-08 11:00:00 | 39.87 | -84.06 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Night | Night | Night |
| 1 | A-2 | Source2 | 2 | 2016-02-08 06:07:59 | 2016-02-08 06:37:59 | 39.93 | -82.83 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Night | Night | Day |
| 2 | A-3 | Source2 | 2 | 2016-02-08 06:49:27 | 2016-02-08 07:19:27 | 39.06 | -84.03 | NaN | NaN | 0.01 | ... | False | False | False | False | True | False | Night | Night | Day | Day |
| 3 | A-4 | Source2 | 3 | 2016-02-08 07:23:34 | 2016-02-08 07:53:34 | 39.75 | -84.21 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Day | Day | Day |
| 4 | A-5 | Source2 | 2 | 2016-02-08 07:39:07 | 2016-02-08 08:09:07 | 39.63 | -84.19 | NaN | NaN | 0.01 | ... | False | False | False | False | True | False | Day | Day | Day | Day |
5 rows × 46 columns
Replacing the missing values with NaN
df.replace(to_replace=['?',' '],value=np.nan,inplace=True)
Checking how many NaN actually exist
df.isna().sum()
ID 0 Source 0 Severity 0 Start_Time 0 End_Time 0 Start_Lat 0 Start_Lng 0 End_Lat 3402762 End_Lng 3402762 Distance(mi) 0 Description 5 Street 10869 City 253 County 0 State 0 Zipcode 1915 Country 0 Timezone 7808 Airport_Code 22635 Weather_Timestamp 120228 Temperature(F) 163853 Wind_Chill(F) 1999019 Humidity(%) 174144 Pressure(in) 140679 Visibility(mi) 177098 Wind_Direction 175206 Wind_Speed(mph) 571233 Precipitation(in) 2203586 Weather_Condition 173459 Amenity 0 Bump 0 Crossing 0 Give_Way 0 Junction 0 No_Exit 0 Railway 0 Roundabout 0 Station 0 Stop 0 Traffic_Calming 0 Traffic_Signal 0 Turning_Loop 0 Sunrise_Sunset 23246 Civil_Twilight 23246 Nautical_Twilight 23246 Astronomical_Twilight 23246 dtype: int64
Few of the columns have only 1 class like 'Country' and 'Turning_Loop'
categorical_col = ['Country','Timezone','Bump','Crossing','Junction','No_Exit','Railway','Roundabout','Station',\
'Stop','Traffic_Signal', 'Turning_Loop','Sunrise_Sunset']
for i in categorical_col:
print(i,df[i].unique().size)
Country 1 Timezone 5 Bump 2 Crossing 2 Junction 2 No_Exit 2 Railway 2 Roundabout 2 Station 2 Stop 2 Traffic_Signal 2 Turning_Loop 1 Sunrise_Sunset 3
Dropping the below columns as some of those('ID','Description', 'Wind_Direction', 'End_Lng', 'End_Lat','Country', 'Source', 'County', 'ID', 'Airport_Code','Precipitation(in)') did not contribute much for insights, some ('Turning_Loop','Country') have only one unique value and 'Weather_Timestamp' has very similar time as "Start_Time".
df.drop(columns=['ID','Description', 'Wind_Direction', 'End_Lng', 'End_Lat', 'Weather_Timestamp',\
'Country', 'Source', 'County', 'ID', 'Airport_Code','Precipitation(in)'],inplace=True)
Dropping Duplicates from the whole dataset
df.drop_duplicates(inplace=True)
Renaming the columns to perform calculations and have easy access (Removing brackets from the column names)
df.rename(columns={'Distance(mi)':'Distance','Temperature(F)':'Temperature','Wind_Chill(F)':'Wind_Chill',\
'Humidity(%)':'Humidity','Pressure(in)':'Pressure','Visibility(mi)':'Visibility',\
'Wind_Speed(mph)':'Wind_Speed'},inplace=True)
Dropping NaNs from the columns with insignificant number of NaNs.
df.dropna(how='any',subset=['City'],inplace=True)
df.dropna(how='any',subset=['Street'],inplace=True)
df.dropna(how='any',subset=['Zipcode'],inplace=True)
df.dropna(how='any',subset=['Timezone'],inplace=True)
df.dropna(how='any',subset=['Civil_Twilight'],inplace=True)
df.dropna(how='any',subset=['Astronomical_Twilight'],inplace=True)
df.dropna(how='any',subset=['Nautical_Twilight'],inplace=True)
Filling in NaN's with the mean and median values of the respective columns as the NaNs in these columns are significant in number.
df['Temperature'].fillna(df['Temperature'].mean(), inplace=True)
df['Wind_Chill'].fillna(df['Wind_Chill'].mean(), inplace=True)
df['Humidity'].fillna(df['Humidity'].mean(), inplace=True)
df['Pressure'].fillna(df['Pressure'].mean(), inplace=True)
df['Visibility'].fillna(df['Visibility'].median(), inplace=True)
df['Wind_Speed'].fillna(df['Wind_Speed'].median(), inplace=True)
Filling in NaN's with forward filling method
df['Weather_Condition'].fillna(method='ffill',inplace=True)
Formatting date-time
df.Start_Time = pd.to_datetime(df.Start_Time)
df.End_Time = pd.to_datetime(df.End_Time)
Checking the number of NaN's present after the data cleaning
df.isna().sum().sum()
0
Final Shape of the Data Frame
df.shape
(7566065, 35)
df.columns.values
array(['Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
'Distance', 'Street', 'City', 'State', 'Zipcode', 'Timezone',
'Temperature', 'Wind_Chill', 'Humidity', 'Pressure', 'Visibility',
'Wind_Speed', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout',
'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight',
'Nautical_Twilight', 'Astronomical_Twilight'], dtype=object)
df.describe()
| Severity | Start_Lat | Start_Lng | Distance | Temperature | Wind_Chill | Humidity | Pressure | Visibility | Wind_Speed | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 | 7566065.00 |
| mean | 2.21 | 36.19 | -94.74 | 0.56 | 61.74 | 58.31 | 64.81 | 29.54 | 9.11 | 7.64 |
| std | 0.49 | 5.07 | 17.39 | 1.77 | 18.82 | 19.27 | 22.57 | 0.99 | 2.66 | 5.22 |
| min | 1.00 | 24.55 | -124.62 | 0.00 | -89.00 | -89.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| 25% | 2.00 | 33.39 | -117.23 | 0.00 | 50.00 | 52.00 | 49.00 | 29.39 | 10.00 | 5.00 |
| 50% | 2.00 | 35.80 | -87.83 | 0.03 | 63.00 | 58.31 | 66.00 | 29.85 | 10.00 | 7.00 |
| 75% | 2.00 | 40.07 | -80.37 | 0.45 | 76.00 | 71.00 | 84.00 | 30.03 | 10.00 | 10.00 |
| max | 4.00 | 49.00 | -67.11 | 441.75 | 207.00 | 207.00 | 100.00 | 58.63 | 140.00 | 1087.00 |
Below bar chart is to give a brief idea of the accident counts over the year
month_df = pd.DataFrame(df.Start_Time.dt.month.value_counts()).reset_index()
month_df.sort_values(by='index')
| index | Start_Time | |
|---|---|---|
| 1 | 1 | 740919 |
| 4 | 2 | 649718 |
| 10 | 3 | 545929 |
| 7 | 4 | 574685 |
| 9 | 5 | 548259 |
| 8 | 6 | 561513 |
| 11 | 7 | 506141 |
| 6 | 8 | 592922 |
| 5 | 9 | 638268 |
| 3 | 10 | 656248 |
| 2 | 11 | 733412 |
| 0 | 12 | 818051 |
month = month_df.rename(columns={'Start_Time':'count','index':'month'}).sort_values(by='month', ascending=True)
# Highlighting the months of interest(only months with min and max counts) here
sns.catplot(x='month',y='count',data=month,kind='bar',\
palette = ["lightpink", "lightpink","lightpink","lightpink","lightpink","lightpink","blue",\
"lightpink", "lightpink","lightpink","lightpink","red"])
<seaborn.axisgrid.FacetGrid at 0x2b62d5250>
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(nrows=4, ncols=2, figsize = (16,20))
road_conditions = ['Bump', 'Crossing', 'Give_Way', 'Junction', 'Stop', 'No_Exit', 'Traffic_Signal', 'Turning_Loop']
colors = [('#6662b3', '#00FF00'), ('#7881ff', '#0e1ce8'), ('#18f2c7', '#09ad8c'), ('#08ff83', '#02a352'), ('#ffcf87', '#f5ab3d'),
('#f5f53d', '#949410'), ('#ff9187', '#ffc7c2'), ('tomato', '#008000')]
count = 0
def func(pct, allvals):
absolute = int(round(pct/100*np.sum(allvals), 2))
return "{:.2f}%\n({:,d} Cases)".format(pct, absolute)
for i in [ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8]:
size = list(df[road_conditions[count]].value_counts())
if len(size) != 2:
size.append(0)
labels = ['False', 'True']
i.pie(size, labels = labels, colors = colors[count],
autopct = lambda pct: func(pct, size), labeldistance=1.1,
textprops={'fontsize': 12}, explode=[0, 0.2])
title = '\nPresence of {}'.format(road_conditions[count])
i.set_title(title, fontsize = 18, color='grey')
count += 1
([<matplotlib.patches.Wedge at 0x2baf30790>, <matplotlib.patches.Wedge at 0x3a589bad0>], [Text(-1.0999988740399846, 0.0015738839747409406, 'False'), Text(1.2999986694113073, -0.0018599808685418522, 'True')], [Text(-0.5999993858399916, 0.0008584821680405131, '99.95%\n(7,562,619 Cases)'), Text(0.7999991811761888, -0.0011446036114103706, '0.05%\n(3,446 Cases)')])
Text(0.5, 1.0, '\nPresence of Bump')
([<matplotlib.patches.Wedge at 0x13da07b90>, <matplotlib.patches.Wedge at 0x13da15dd0>], [Text(-1.030796917927756, 0.3840022317521592, 'False'), Text(1.2182145499916088, -0.45382079082909144, 'True')], [Text(-0.5622528643242305, 0.20945576277390499, '88.65%\n(6,707,239 Cases)'), Text(0.7496704923025285, -0.2792743328179024, '11.35%\n(858,825 Cases)')])
Text(0.5, 1.0, '\nPresence of Crossing')
([<matplotlib.patches.Wedge at 0x13da1abd0>, <matplotlib.patches.Wedge at 0x13d9a5f50>], [Text(-1.0998790396284326, 0.016312516238679214, 'False'), Text(1.299857045790098, -0.019278498640697785, 'True')], [Text(-0.5999340216155086, 0.008897736130188662, '99.53%\n(7,530,348 Cases)'), Text(0.7999120281785217, -0.011863691471198637, '0.47%\n(35,716 Cases)')])
Text(0.5, 1.0, '\nPresence of Give_Way')
([<matplotlib.patches.Wedge at 0x13da2c3d0>, <matplotlib.patches.Wedge at 0x13d9a5650>], [Text(-1.070813345297021, 0.2517116992390363, 'False'), Text(1.2655066668796437, -0.29747752197968597, 'True')], [Text(-0.5840800065256477, 0.13729729049401979, '92.65%\n(7,010,037 Cases)'), Text(0.7787733334643961, -0.1830630904490375, '7.35%\n(556,027 Cases)')])
Text(0.5, 1.0, '\nPresence of Junction')
([<matplotlib.patches.Wedge at 0x13da38310>, <matplotlib.patches.Wedge at 0x3a3251690>], [Text(-1.0958176459248923, 0.09583155471778404, 'False'), Text(1.295057213934832, -0.11325551922688476, 'True')], [Text(-0.597718715959032, 0.05227175711879129, '97.22%\n(7,355,983 Cases)'), Text(0.7969582854983581, -0.06969570413962138, '2.78%\n(210,081 Cases)')])
Text(0.5, 1.0, '\nPresence of Stop')
([<matplotlib.patches.Wedge at 0x13da3a790>, <matplotlib.patches.Wedge at 0x3415853d0>], [Text(-1.0999649825775546, 0.00877707828152855, 'False'), Text(1.2999586164411587, -0.010372827020102317, 'True')], [Text(-0.599980899587757, 0.004787497244470117, '99.75%\n(7,546,848 Cases)'), Text(0.7999745331945591, -0.00638327816621681, '0.25%\n(19,217 Cases)')])
Text(0.5, 1.0, '\nPresence of No_Exit')
([<matplotlib.patches.Wedge at 0x13da54110>, <matplotlib.patches.Wedge at 0x13d9a52d0>], [Text(-0.9826720942575355, 0.4943233305919411, 'False'), Text(1.1613397204105145, -0.5842003541567125, 'True')], [Text(-0.5360029605041102, 0.2696309075956042, '85.16%\n(6,443,587 Cases)'), Text(0.7146705971757011, -0.3595079102502846, '14.84%\n(1,122,477 Cases)')])
Text(0.5, 1.0, '\nPresence of Traffic_Signal')
([<matplotlib.patches.Wedge at 0x13da54790>, <matplotlib.patches.Wedge at 0x13da39350>], [Text(-1.1, 1.3471114790620887e-16, 'False'), Text(1.3, -3.1840816777831187e-16, 'True')], [Text(-0.6, 7.347880794884119e-17, '100.00%\n(7,566,065 Cases)'), Text(0.8, -1.9594348786357652e-16, '0.00%\n(0 Cases)')])
Text(0.5, 1.0, '\nPresence of Turning_Loop')
df_zone = df.copy()
df_zone.columns
Index(['Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
'Distance', 'Street', 'City', 'State', 'Zipcode', 'Timezone',
'Temperature', 'Wind_Chill', 'Humidity', 'Pressure', 'Visibility',
'Wind_Speed', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
'Astronomical_Twilight'],
dtype='object')
Northwest =['OR','WA','ID']
Southwest =['CA','NV','AZ','UT']
North_Central = ['MT','WY', 'CO','ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO']
South_Central = ['NM','TX','OK','LA']
Midwest =['WI', 'IL','IN', 'OH','MI']
Southeast = ['AR','TN', 'MS', 'AL', 'FL', 'GA', 'SC', 'NC']
Northeast = [ 'DC','KY', 'VA', 'WV', 'MD', 'DE', 'PA', 'NJ', 'NY', 'CT', 'RI', 'MA', 'NH', 'VT', 'ME']
df_zone['Zone'] = np.select(
[df_zone['State'].isin(Northwest), df_zone['State'].isin(Southwest), df_zone['State'].isin(North_Central), df_zone['State'].isin(South_Central),df_zone['State'].isin(Midwest),df_zone['State'].isin(Southeast),df_zone['State'].isin(Northeast)],
['Northwest', 'Southwest', 'North Central', 'South Central','Midwest','Southeast','Northeast'],
default='Unknown'
)
plt.figure(figsize=(12, 8))
# Grouping by Zone and calculating the counts for each parameter
grouped_df = df_zone.groupby('Zone')[['Traffic_Signal', 'Crossing', 'Junction']].sum().reset_index()
# Melting the DataFrame for better visualization
melted_df = pd.melt(grouped_df, id_vars='Zone', var_name=' Road Parameters', value_name='Count')
# Plotting the bar graph
ax = sns.barplot(x='Zone', y='Count', hue=' Road Parameters', data=melted_df)
plt.title('Accidents Count by Zone and Road Parameters')
plt.xlabel('Zone')
plt.ylabel('Count')
# Calculating and displaying the percentages on each bar
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x() + p.get_width() / 2., height, f'{height / grouped_df["Traffic_Signal"].sum() * 100:.2f}%',
ha="center", va="bottom")
plt.show()
<Figure size 1200x800 with 0 Axes>
Text(0.5, 1.0, 'Accidents Count by Zone and Road Parameters')
Text(0.5, 0, 'Zone')
Text(0, 0.5, 'Count')
Text(-0.2666666666666667, 105062.0, '9.36%')
Text(0.7333333333333333, 44441.0, '3.96%')
Text(1.7333333333333334, 190329.0, '16.96%')
Text(2.7333333333333334, 29074.0, '2.59%')
Text(3.7333333333333334, 205623.0, '18.32%')
Text(4.733333333333333, 350971.0, '31.27%')
Text(5.733333333333333, 196977.0, '17.55%')
Text(0.0, 66805.0, '5.95%')
Text(1.0, 34637.0, '3.09%')
Text(2.0, 126794.0, '11.30%')
Text(3.0, 38433.0, '3.42%')
Text(4.0, 111878.0, '9.97%')
Text(5.0, 322154.0, '28.70%')
Text(6.0, 158125.0, '14.09%')
Text(0.2666666666666667, 39247.0, '3.50%')
Text(1.2666666666666666, 41820.0, '3.73%')
Text(2.2666666666666666, 130405.0, '11.62%')
Text(3.2666666666666666, 15280.0, '1.36%')
Text(4.266666666666667, 39636.0, '3.53%')
Text(5.266666666666667, 95424.0, '8.50%')
Text(6.266666666666667, 194215.0, '17.30%')
## Decision Tree
df2 = df_zone.copy()
zone_mapping = {'Northwest':1, 'Southwest':2, 'North Central':3, 'South Central':4,'Midwest':5,'Southeast':6,\
'Northeast':7}
df2['Zone'] = df2['Zone'].map(zone_mapping)
df2.drop(columns=['Start_Time','Start_Lat','Start_Lng','Street','City','State','Zipcode',\
'Wind_Speed','Distance','Railway','No_Exit','Timezone','Temperature','Visibility',\
'Weather_Condition','Civil_Twilight','Nautical_Twilight', 'Astronomical_Twilight','End_Time',\
'Wind_Chill','Humidity','Pressure','Turning_Loop'],inplace= True)
df2.columns
Index(['Severity', 'Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction',
'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
'Sunrise_Sunset', 'Zone'],
dtype='object')
df2['Bump']= df2['Bump']+0.0
df2['Crossing'] = df2['Crossing'] + 0.0
df2['Give_Way'] = df2['Give_Way'] + 0.0
df2['Junction'] = df2['Junction'] + 0.0
df2['Roundabout'] = df2['Roundabout'] + 0.0
df2['Station'] = df2['Station'] + 0.0
df2['Stop'] = df2['Stop'] + 0.0
df2['Traffic_Signal'] = df2['Traffic_Signal'] + 0.0
df2['Traffic_Calming'] = df2['Traffic_Calming'] +0.0
df2['Sunrise_Sunset'] = df2['Sunrise_Sunset'].map({'Day': 1, 'Night': 0})
df2['Sunrise_Sunset'].fillna(df2['Sunrise_Sunset'].mean(), inplace=True)
dt= tree.DecisionTreeClassifier(max_depth=1)
X = df2.drop('Zone',axis=1)
Y = df2.Zone
dt.fit(X,Y)
DecisionTreeClassifier(max_depth=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=1)
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in Y.unique()]
tree.export_graphviz(dt, out_file='tree.dot',
feature_names=dt_feature_names, class_names=dt_target_names,
filled=True)
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())
# Selecting rows for Zone 7 and Zone 4 with Traffic_Signal equal to 1 for validation of the decision tree
selected_zones_traffic_signal_1 = df2[(df2['Zone'].isin([7, 4])) & (df2['Traffic_Signal'] == 1)]
# Plotting
plt.figure(figsize=(10, 6))
sns.countplot(data=selected_zones_traffic_signal_1, x='Zone')
plt.title('Count of Accidents with Traffic Signal = 1 for Zone 7 and Zone 4')
plt.xlabel('Zone')
plt.ylabel('Count')
plt.show()
<Figure size 1000x600 with 0 Axes>
<Axes: xlabel='Zone', ylabel='count'>
Text(0.5, 1.0, 'Count of Accidents with Traffic Signal = 1 for Zone 7 and Zone 4')
Text(0.5, 0, 'Zone')
Text(0, 0.5, 'Count')
Insight
You will be amazed with the fact that almost 34% of accidents have occoured even though there were Traffic Signal, Crossing and Junction present in location. ((Traffic Signal) 14.84 + (Crossing) 11.35 + (Junction) 7.35 =33.54 %)
On detailed analysis of these factors by diving the US states into 7 zones based on geographical conditions, it is quite evident from the graph that 68.47% of accidents in south east region occurred even though these 3 factors(Traffic signal,Crossing and Junction) were present there. South East - Traffic Signal) 31.27 + (Crossing) 28.7 + (Junction) 8.50 = 68.47 %
Recommendation:
More penalties must be levied by the Government for traffic rules violations( specially in south east region) so that people would be more cautious while driving and accidents count would be less.
severity_counts = df['Severity'].value_counts()
severity_counts
# Most accidents are of severity '2' and '3',
# with very few cases of severity '1' and '4'.
2 6011746 3 1291377 4 197935 1 65007 Name: Severity, dtype: int64
pie_chart = df.groupby('Severity')['Severity'].count().\
plot(kind='pie',figsize=(6, 6),autopct='%1.0f%%',cmap="Blues")
labels = severity_counts.index.tolist()
# Adding a title to the pie chart
pie_chart.set_title('Distribution of Accident Severity', fontsize=15)
Text(0.5, 1.0, 'Distribution of Accident Severity')
# 5. Traffic Management and Infrastructure Improvement
# Analyzing accidents based on city or county
citywise_accidents = df['City'].value_counts().head(10) # Top 10 cities
citywise_accidents
# Citywise distribution plot
sns.barplot(y=citywise_accidents.index, x=citywise_accidents.values)
plt.title('Top 10 Cities with Most Accidents')
plt.xlabel('Number of Accidents')
plt.ylabel('City')
# plt.text(1, max(citywise_accidents.values) * 1, 'Miami and LA identify areas with high accident frequencies to improve road safety measures', fontsize=6, color='black')
plt.show()
Miami 183485 Houston 168242 Los Angeles 154732 Charlotte 136731 Dallas 129743 Orlando 108517 Austin 96411 Raleigh 85057 Nashville 72210 Baton Rouge 70682 Name: City, dtype: int64
<Axes: >
Text(0.5, 1.0, 'Top 10 Cities with Most Accidents')
Text(0.5, 0, 'Number of Accidents')
Text(0, 0.5, 'City')
Hotspot Analysis: Major citis like Miami, Houston, LA identify areas with high accident frequencies
# Group by 'City' and count the number of accidents
city_accidents = df.groupby('City').size().reset_index(name='Accident_Count')
# 'Start_Lat' and 'Start_Lng' give the coordinates for the accidents
cities_df = df[['City', 'Start_Lat', 'Start_Lng']].drop_duplicates(subset=['City'])
# Merge the city locations with the accident counts
plotting_df = pd.merge(cities_df, city_accidents, on='City')
# Set your Mapbox access token here
px.set_mapbox_access_token('pk.eyJ1IjoiYWppbmt5YWRlc2FpIiwiYSI6ImNscGd6OHJlbzAyOXoyanJ4a3E5eHM1Y3kifQ.ZzUvGP5rqkSuSmsfAYZ3HA')
# Create a scatter mapbox to visualize accidents by city
fig = px.scatter_mapbox(plotting_df,
lat='Start_Lat',
lon='Start_Lng',
size='Accident_Count',
color='Accident_Count',
color_continuous_scale=px.colors.sequential.Tealgrn,
size_max=20,
zoom=3,
hover_name='City',
# height=1200, # You may adjust this value as needed
# width=800,
title='Accidents in the United States by City',
mapbox_style='light')
# Show the plot
fig.show()
state_severity_counts = df.groupby(['State','Severity']).size().reset_index(name='Accident_Count')
# First, sort the DataFrame by 'State' and 'State_Severity_Count' in descending order
state_severity_counts_sorted = state_severity_counts.sort_values(by=['State', 'Accident_Count'], ascending=[True, False])
# Now, drop duplicate states, keeping the first occurrence (which will be the highest severity due to sorting)
highest_severity_per_state = state_severity_counts_sorted.drop_duplicates(subset='State')
highest_severity_per_state_sorted = highest_severity_per_state.sort_values(by='Accident_Count', ascending=False)
len(highest_severity_per_state_sorted)
# Print the resulting DataFrame
highest_severity_per_state_sorted.head(3)
49
| State | Severity | Accident_Count | |
|---|---|---|---|
| 13 | CA | 2 | 1419150 |
| 33 | FL | 2 | 737886 |
| 163 | TX | 2 | 445580 |
state_accident_counts = df.groupby(['State',]).size().reset_index(name='Accident_Count')
state_accident_counts_sorted = state_accident_counts.sort_values(by='Accident_Count', ascending=False)
state_accident_counts_sorted.head(3)
| State | Accident_Count | |
|---|---|---|
| 3 | CA | 1713431 |
| 8 | FL | 860790 |
| 41 | TX | 576923 |
fig = px.choropleth(state_accident_counts,
locations="State",
locationmode='USA-states',
color="Accident_Count",
hover_name="State",
scope='usa',
title='US Accidents Statewise Count',
width=800,
height=400,
color_continuous_scale=px.colors.sequential.Tealgrn,
labels={'Accident_Count': 'Number of Accidents'}
)
fig.show()
It seems that most accidents with severity 2 are happending in major states like California, Texas, Florida
correlation_matrix = df.corr()
print(correlation_matrix)
Severity Start_Lat Start_Lng Distance Temperature \
Severity 1.00 0.07 0.05 0.04 -0.02
Start_Lat 0.07 1.00 -0.07 0.06 -0.44
Start_Lng 0.05 -0.07 1.00 0.01 -0.01
Distance 0.04 0.06 0.01 1.00 -0.05
Temperature -0.02 -0.44 -0.01 -0.05 1.00
Wind_Chill -0.06 -0.42 -0.03 -0.04 0.90
Humidity 0.02 0.02 0.18 0.01 -0.33
Pressure 0.04 -0.19 0.19 -0.09 0.11
Visibility -0.00 -0.09 -0.02 -0.04 0.21
Wind_Speed 0.04 0.03 0.07 0.01 0.03
Amenity -0.03 0.02 0.01 -0.03 0.01
Bump -0.01 -0.00 -0.01 -0.00 0.00
Crossing -0.11 -0.07 0.06 -0.09 0.06
Give_Way -0.00 0.00 0.02 -0.01 0.00
Junction 0.05 0.05 -0.04 0.03 -0.02
No_Exit -0.01 -0.01 0.01 -0.01 0.01
Railway -0.01 0.00 -0.02 -0.02 0.00
Roundabout -0.00 -0.00 0.00 -0.00 0.00
Station -0.05 -0.05 0.01 -0.04 0.03
Stop -0.05 -0.00 -0.03 -0.03 0.01
Traffic_Calming -0.01 -0.00 -0.00 -0.01 0.00
Traffic_Signal -0.11 -0.07 0.07 -0.11 0.05
Turning_Loop NaN NaN NaN NaN NaN
Wind_Chill Humidity Pressure Visibility Wind_Speed ... \
Severity -0.06 0.02 0.04 -0.00 0.04 ...
Start_Lat -0.42 0.02 -0.19 -0.09 0.03 ...
Start_Lng -0.03 0.18 0.19 -0.02 0.07 ...
Distance -0.04 0.01 -0.09 -0.04 0.01 ...
Temperature 0.90 -0.33 0.11 0.21 0.03 ...
Wind_Chill 1.00 -0.27 0.08 0.20 -0.04 ...
Humidity -0.27 1.00 0.11 -0.38 -0.17 ...
Pressure 0.08 0.11 1.00 0.04 -0.02 ...
Visibility 0.20 -0.38 0.04 1.00 0.01 ...
Wind_Speed -0.04 -0.17 -0.02 0.01 1.00 ...
Amenity 0.00 -0.01 0.02 0.01 0.00 ...
Bump 0.00 -0.01 -0.01 0.00 0.00 ...
Crossing 0.05 -0.03 0.03 0.04 0.03 ...
Give_Way -0.00 0.00 0.01 0.00 0.00 ...
Junction -0.03 -0.00 0.03 -0.01 0.01 ...
No_Exit 0.01 -0.01 -0.00 0.01 0.00 ...
Railway 0.00 -0.00 0.01 0.00 -0.00 ...
Roundabout 0.00 0.00 0.00 0.00 0.00 ...
Station 0.03 -0.01 0.04 0.02 0.01 ...
Stop 0.01 -0.02 -0.01 0.01 0.00 ...
Traffic_Calming 0.01 -0.00 0.00 0.00 0.00 ...
Traffic_Signal 0.02 -0.01 0.04 0.03 0.02 ...
Turning_Loop NaN NaN NaN NaN NaN ...
Give_Way Junction No_Exit Railway Roundabout Station \
Severity -0.00 0.05 -0.01 -0.01 -0.00 -0.05
Start_Lat 0.00 0.05 -0.01 0.00 -0.00 -0.05
Start_Lng 0.02 -0.04 0.01 -0.02 0.00 0.01
Distance -0.01 0.03 -0.01 -0.02 -0.00 -0.04
Temperature 0.00 -0.02 0.01 0.00 0.00 0.03
Wind_Chill -0.00 -0.03 0.01 0.00 0.00 0.03
Humidity 0.00 -0.00 -0.01 -0.00 0.00 -0.01
Pressure 0.01 0.03 -0.00 0.01 0.00 0.04
Visibility 0.00 -0.01 0.01 0.00 0.00 0.02
Wind_Speed 0.00 0.01 0.00 -0.00 0.00 0.01
Amenity 0.01 -0.03 0.01 0.05 0.00 0.15
Bump 0.00 -0.00 0.01 0.00 -0.00 0.00
Crossing 0.06 -0.09 0.06 0.18 0.00 0.17
Give_Way 1.00 -0.01 0.01 0.00 0.00 -0.00
Junction -0.01 1.00 -0.00 -0.01 0.01 -0.04
No_Exit 0.01 -0.00 1.00 0.00 -0.00 0.02
Railway 0.00 -0.01 0.00 1.00 -0.00 0.12
Roundabout 0.00 0.01 -0.00 -0.00 1.00 -0.00
Station -0.00 -0.04 0.02 0.12 -0.00 1.00
Stop 0.03 -0.04 0.03 0.01 0.00 0.04
Traffic_Calming 0.00 -0.00 0.01 0.00 0.00 0.01
Traffic_Signal 0.07 -0.10 0.03 0.06 -0.00 0.12
Turning_Loop NaN NaN NaN NaN NaN NaN
Stop Traffic_Calming Traffic_Signal Turning_Loop
Severity -0.05 -0.01 -0.11 NaN
Start_Lat -0.00 -0.00 -0.07 NaN
Start_Lng -0.03 -0.00 0.07 NaN
Distance -0.03 -0.01 -0.11 NaN
Temperature 0.01 0.00 0.05 NaN
Wind_Chill 0.01 0.01 0.02 NaN
Humidity -0.02 -0.00 -0.01 NaN
Pressure -0.01 0.00 0.04 NaN
Visibility 0.01 0.00 0.03 NaN
Wind_Speed 0.00 0.00 0.02 NaN
Amenity 0.03 0.02 0.11 NaN
Bump 0.02 0.68 -0.00 NaN
Crossing 0.12 0.04 0.48 NaN
Give_Way 0.03 0.00 0.07 NaN
Junction -0.04 -0.00 -0.10 NaN
No_Exit 0.03 0.01 0.03 NaN
Railway 0.01 0.00 0.06 NaN
Roundabout 0.00 0.00 -0.00 NaN
Station 0.04 0.01 0.12 NaN
Stop 1.00 0.03 -0.05 NaN
Traffic_Calming 0.03 1.00 0.01 NaN
Traffic_Signal -0.05 0.01 1.00 NaN
Turning_Loop NaN NaN NaN NaN
[23 rows x 23 columns]
# Plotting the heatmap
plt.figure(figsize=(15, 15)) # Making the plot larger
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Improving the aesthetics
plt.title('Correlation Matrix of Accident Data', fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=10) # Rotating x labels for better readability
plt.yticks(fontsize=10)
plt.tight_layout() # Adjusts the plot to ensure everything fits without overlapping
# Display the plot
plt.show()
<Figure size 1500x1500 with 0 Axes>
<Axes: >
Text(0.5, 1.0, 'Correlation Matrix of Accident Data')
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
22.5]),
[Text(0.5, 0, 'Severity'),
Text(1.5, 0, 'Start_Lat'),
Text(2.5, 0, 'Start_Lng'),
Text(3.5, 0, 'Distance'),
Text(4.5, 0, 'Temperature'),
Text(5.5, 0, 'Wind_Chill'),
Text(6.5, 0, 'Humidity'),
Text(7.5, 0, 'Pressure'),
Text(8.5, 0, 'Visibility'),
Text(9.5, 0, 'Wind_Speed'),
Text(10.5, 0, 'Amenity'),
Text(11.5, 0, 'Bump'),
Text(12.5, 0, 'Crossing'),
Text(13.5, 0, 'Give_Way'),
Text(14.5, 0, 'Junction'),
Text(15.5, 0, 'No_Exit'),
Text(16.5, 0, 'Railway'),
Text(17.5, 0, 'Roundabout'),
Text(18.5, 0, 'Station'),
Text(19.5, 0, 'Stop'),
Text(20.5, 0, 'Traffic_Calming'),
Text(21.5, 0, 'Traffic_Signal'),
Text(22.5, 0, 'Turning_Loop')])
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
22.5]),
[Text(0, 0.5, 'Severity'),
Text(0, 1.5, 'Start_Lat'),
Text(0, 2.5, 'Start_Lng'),
Text(0, 3.5, 'Distance'),
Text(0, 4.5, 'Temperature'),
Text(0, 5.5, 'Wind_Chill'),
Text(0, 6.5, 'Humidity'),
Text(0, 7.5, 'Pressure'),
Text(0, 8.5, 'Visibility'),
Text(0, 9.5, 'Wind_Speed'),
Text(0, 10.5, 'Amenity'),
Text(0, 11.5, 'Bump'),
Text(0, 12.5, 'Crossing'),
Text(0, 13.5, 'Give_Way'),
Text(0, 14.5, 'Junction'),
Text(0, 15.5, 'No_Exit'),
Text(0, 16.5, 'Railway'),
Text(0, 17.5, 'Roundabout'),
Text(0, 18.5, 'Station'),
Text(0, 19.5, 'Stop'),
Text(0, 20.5, 'Traffic_Calming'),
Text(0, 21.5, 'Traffic_Signal'),
Text(0, 22.5, 'Turning_Loop')])
# Top 5 highest correlations
# Flatten the correlation matrix and sort values
corr_pairs = correlation_matrix.unstack().sort_values(ascending=False).drop_duplicates()
# Remove self-correlation pairs
corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) != corr_pairs.index.get_level_values(1)]
# Get the top 5 highest correlations
highest_corr_pairs = corr_pairs.head(5)
print("Top 5 highest correlations:")
print(highest_corr_pairs)
Top 5 highest correlations: Wind_Chill Temperature 0.90 Bump Traffic_Calming 0.68 Traffic_Signal Crossing 0.48 Visibility Temperature 0.21 Wind_Chill Visibility 0.20 dtype: float64
# Extracting the names of the columns involved in the highest correlations
highest_corr_columns = set()
for (col1, col2) in highest_corr_pairs.index:
highest_corr_columns.add(col1)
highest_corr_columns.add(col2)
# Creating a new DataFrame with the highest correlated columns
highest_corr_df = correlation_matrix.loc[highest_corr_columns, highest_corr_columns]
# Plotting a heatmap for the highest correlations
plt.figure(figsize=(10, 8))
sns.heatmap(highest_corr_df, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.title('Heatmap of Highest Correlations')
plt.show()
<Figure size 1000x800 with 0 Axes>
<Axes: >
(array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5]), [Text(0.5, 0, 'Traffic_Calming'), Text(1.5, 0, 'Visibility'), Text(2.5, 0, 'Temperature'), Text(3.5, 0, 'Crossing'), Text(4.5, 0, 'Traffic_Signal'), Text(5.5, 0, 'Bump'), Text(6.5, 0, 'Wind_Chill')])
Text(0.5, 1.0, 'Heatmap of Highest Correlations')
Temperature and Wind Chill: There is a high positive correlation between temperature and wind chill, which is expected as they are both related to weather conditions. However, there is counterintutive finding on accidents vs weather. Most acccidents are happening at moderate temperature 50 F (rather than extreme weathers) in concenterated cities and states.
Traffic Calming and Bump: Traffic calming measures have a strong positive correlation with the presence of bumps. This indicates that bumps are a commonly used traffic calming measure.
BUMP IMPACT ON ACCIDENTS
Bump_incident_counts = df.groupby('Bump').size().reset_index(name='Number_of_Incidents')
Bump_incident_counts
barplot=sns.barplot(x='Bump', y='Number_of_Incidents', data=Bump_incident_counts)
plt.title('Impact of Bumps on Number of Incidents')
plt.xlabel('Bump Presence')
plt.ylabel('Number of Incidents')
plt.yscale('log')
# Annotate each bar with the value
for index, row in Bump_incident_counts.iterrows():
barplot.text(index, row.Number_of_Incidents, row.Number_of_Incidents, color='black', ha="center")
plt.show()
# Traffic Calming and Bumps: Bumps are effectively used to calm traffic.
# The government should assess and possibly increase their use, especially near schools and residential areas.
| Bump | Number_of_Incidents | |
|---|---|---|
| 0 | False | 7562619 |
| 1 | True | 3446 |
Text(0.5, 1.0, 'Impact of Bumps on Number of Incidents')
Text(0.5, 0, 'Bump Presence')
Text(0, 0.5, 'Number of Incidents')
Text(0, 7562619, '7562619')
Text(1, 3446, '3446')
# Group by 'State' and 'Bump' and count the number of incidents
state_bump_counts = df.groupby(['State', 'Bump']).size().reset_index(name='Number_of_Incidents')
state_bump_counts.sort_values(by=['Number_of_Incidents'],ascending= False).head(10)
| State | Bump | Number_of_Incidents | |
|---|---|---|---|
| 4 | CA | False | 1712434 |
| 13 | FL | False | 860487 |
| 67 | TX | False | 576676 |
| 62 | SC | False | 374995 |
| 51 | NY | False | 336935 |
| 40 | NC | False | 330052 |
| 71 | VA | False | 296286 |
| 59 | PA | False | 286779 |
| 35 | MN | False | 187497 |
| 57 | OR | False | 172814 |
# Group by 'State' and 'Bump' and count the number of incidents
state_bump_counts = df.groupby(['State', 'Bump']).size().reset_index(name='Number_of_Incidents')
# Sort the results by the number of incidents, not state, to get the top incidents
state_bump_counts_sorted = state_bump_counts.sort_values('Number_of_Incidents', ascending=False).head(10)
# Plotting the results
plt.figure(figsize=(14, 10))
barplot = sns.barplot(data=state_bump_counts_sorted, x='State', y='Number_of_Incidents', hue='Bump')
plt.xticks(rotation=90)
plt.title('Statewise Impact of Bumps on Number of Incidents')
plt.xlabel('State')
plt.ylabel('Number of Incidents')
# plt.legend(title='Bump Presence')
plt.legend(title='Bump Presence', loc='upper left', bbox_to_anchor=(0.87, 1))
plt.tight_layout(rect=[0, 0, 0.9, 1]) # Adjust the rect parameter to make space for the legend
# Annotate the top 3 states
top_states = state_bump_counts_sorted.head(3)
for i, (index, row) in enumerate(top_states.iterrows()):
# Get the x location of the bar
x = i
# Annotate with state name and incident count
plt.text(x, row['Number_of_Incidents'], f"{row['State']}: {int(row['Number_of_Incidents'])}", color='black', ha="left")
# plt.tight_layout()
plt.show()
<Figure size 1400x1000 with 0 Axes>
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), [Text(0, 0, 'CA'), Text(1, 0, 'FL'), Text(2, 0, 'TX'), Text(3, 0, 'SC'), Text(4, 0, 'NY'), Text(5, 0, 'NC'), Text(6, 0, 'VA'), Text(7, 0, 'PA'), Text(8, 0, 'MN'), Text(9, 0, 'OR')])
Text(0.5, 1.0, 'Statewise Impact of Bumps on Number of Incidents')
Text(0.5, 0, 'State')
Text(0, 0.5, 'Number of Incidents')
<matplotlib.legend.Legend at 0x466675610>
Text(0, 1712434, 'CA: 1712434')
Text(1, 860487, 'FL: 860487')
Text(2, 576676, 'TX: 576676')
Comparison number of accidents in region without and with inclusion of Signal Bumps
# Filter the data to include only rows with 'Bumps' being False
no_bump_data = df[df['Bump'] == False]
# Define regions based on the median values of latitude and longitude
median_lat = no_bump_data['Start_Lat'].median()
median_lng = no_bump_data['Start_Lng'].median()
# Function to determine the region based on latitude and longitude
def determine_region(lat, lng, median_lat, median_lng):
if lat >= median_lat and lng >= median_lng:
return 'North-East'
elif lat < median_lat and lng >= median_lng:
return 'South-East'
elif lat >= median_lat and lng < median_lng:
return 'North-West'
else:
return 'South-West'
# Apply the function to create a new 'Region' column
no_bump_data['Region'] = no_bump_data.apply(lambda x: determine_region(x['Start_Lat'], x['Start_Lng'], median_lat, median_lng), axis=1)
# Use KMeans to cluster the data based on 'Latitude' and 'Longitude'
kmeans = KMeans(n_clusters=4) # We choose 4 to match the number of regions we've defined
no_bump_data['Cluster'] = kmeans.fit_predict(no_bump_data[['Start_Lat', 'Start_Lng']])
# Plotting
plt.figure(figsize=(10, 6))
# We use hue to color the data points based on the new 'Region' column
sns.scatterplot(data=no_bump_data, x='Start_Lng', y='Start_Lat', hue='Region', style='Cluster',
palette='Set1', alpha=0.6)
# Plot the cluster centers
centers = kmeans.cluster_centers_
# Count the number of accidents per cluster
cluster_accident_counts = no_bump_data.groupby('Cluster').size()
# Annotate the cluster centers with the accident counts
for i, count in enumerate(cluster_accident_counts):
plt.scatter(centers[i, 1], centers[i, 0], c='black', s=100, alpha=0.75, marker='X')
plt.text(centers[i, 1], centers[i, 0], str(count), color='black', fontsize=12, ha='left', va='bottom')
plt.title('Accidents Region-wise Clustering Without Bumps')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# The geometric center of each cluster, centriod
<Figure size 1000x600 with 0 Axes>
<Axes: xlabel='Start_Lng', ylabel='Start_Lat'>
<matplotlib.collections.PathCollection at 0x466934cd0>
Text(-94.71852943148494, 36.09142266368345, '1504984')
<matplotlib.collections.PathCollection at 0x13dca01d0>
Text(-118.79929848386561, 37.07373277317291, '2317352')
<matplotlib.collections.PathCollection at 0x13dc97490>
Text(-77.37245365483233, 40.020127049912716, '1880248')
<matplotlib.collections.PathCollection at 0x466982710>
Text(-82.33422544395954, 31.285675518950853, '1860035')
Text(0.5, 1.0, 'Accidents Region-wise Clustering Without Bumps')
Text(0.5, 0, 'Longitude')
Text(0, 0.5, 'Latitude')
<matplotlib.legend.Legend at 0x4669bb7d0>
The cluster shows most happening without bumps. The North-East region has a dense concentration of accidents, which might correspond to an urbanized area with heavy traffic.
# Filter the data to include only rows with 'Bumps' being False
no_bump_data = df[df['Bump'] == True]
# Define regions based on the median values of latitude and longitude
median_lat = no_bump_data['Start_Lat'].median()
median_lng = no_bump_data['Start_Lng'].median()
# Function to determine the region based on latitude and longitude
def determine_region(lat, lng, median_lat, median_lng):
if lat >= median_lat and lng >= median_lng:
return 'North-East'
elif lat < median_lat and lng >= median_lng:
return 'South-East'
elif lat >= median_lat and lng < median_lng:
return 'North-West'
else:
return 'South-West'
# Apply the function to create a new 'Region' column
no_bump_data['Region'] = no_bump_data.apply(lambda x: determine_region(x['Start_Lat'], x['Start_Lng'], median_lat, median_lng), axis=1)
# Use KMeans to cluster the data based on 'Latitude' and 'Longitude'
kmeans = KMeans(n_clusters=4) # We choose 4 to match the number of regions we've defined
no_bump_data['Cluster'] = kmeans.fit_predict(no_bump_data[['Start_Lat', 'Start_Lng']])
# Plotting
plt.figure(figsize=(10, 6))
# We use hue to color the data points based on the new 'Region' column
sns.scatterplot(data=no_bump_data, x='Start_Lng', y='Start_Lat', hue='Region', style='Cluster',
palette='Set1', alpha=0.6)
# Plot the cluster centers
centers = kmeans.cluster_centers_
# Count the number of accidents per cluster
cluster_accident_counts = no_bump_data.groupby('Cluster').size()
# Annotate the cluster centers with the accident counts
for i, count in enumerate(cluster_accident_counts):
plt.scatter(centers[i, 1], centers[i, 0], c='black', s=100, alpha=0.75, marker='X')
plt.text(centers[i, 1], centers[i, 0], str(count), color='black', fontsize=12, ha='left', va='bottom')
plt.title('Accidents Region-wise Clustering with Bumps')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
<Figure size 1000x600 with 0 Axes>
<Axes: xlabel='Start_Lng', ylabel='Start_Lat'>
<matplotlib.collections.PathCollection at 0x3ad8a2390>
Text(-110.46515301034441, 33.69676629985365, '849')
<matplotlib.collections.PathCollection at 0x3ad86ba50>
Text(-78.52076859639344, 34.170773181142074, '859')
<matplotlib.collections.PathCollection at 0x3ad867dd0>
Text(-121.55398349022491, 39.17052359944348, '1310')
<matplotlib.collections.PathCollection at 0x3ad860350>
Text(-96.0865541878241, 32.59554682332386, '428')
Text(0.5, 1.0, 'Accidents Region-wise Clustering with Bumps')
Text(0.5, 0, 'Longitude')
Text(0, 0.5, 'Latitude')
<matplotlib.legend.Legend at 0x3ad8663d0>
The cluster shows least accidents happening with inclusion of speed bumps.
Summary of Finding: Regions with signal bumps tend to have lower accident counts, indicating that bumps may contribute to road safety and accident prevention.
Validity of Finding: The clustering figure shows clear patterns where regions without and with signal bumps (represented by different clusters colors) have varying accident counts.
Managerial Insights: Get more bumps! The government should increase the number of signal bumps in high-accident areas could effectively reduce accident rates. Companies involved in road safety solutions could see increased demand for signal bumps and related traffic-calming products.
df_11 = df.copy()
# Example bin definitions
weather_bins = {
'Clear': ['Clear', 'Fair'],
'Cloudy': ['Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Scattered Clouds', 'Overcast'],
'Rainy': ['Light Rain', 'Rain', 'Light Freezing Drizzle', 'Light Drizzle', 'Heavy Rain', 'Light Freezing Rain', 'Drizzle', 'Light Freezing Fog', 'Light Rain Showers', 'Showers in the Vicinity', 'T-Storm', 'Thunder', 'Patches of Fog', 'Heavy T-Storm', 'Heavy Thunderstorms and Rain', 'Funnel Cloud', 'Heavy T-Storm / Windy', 'Heavy Thunderstorms and Snow', 'Rain / Windy', 'Heavy Rain / Windy', 'Squalls', 'Heavy Ice Pellets', 'Thunder / Windy', 'Drizzle and Fog', 'T-Storm / Windy', 'Smoke / Windy', 'Haze / Windy', 'Light Drizzle / Windy', 'Widespread Dust / Windy', 'Wintry Mix', 'Wintry Mix / Windy', 'Light Snow with Thunder', 'Fog / Windy', 'Snow and Thunder', 'Sleet / Windy', 'Heavy Freezing Rain / Windy', 'Squalls / Windy', 'Light Rain Shower / Windy', 'Snow and Thunder / Windy', 'Light Sleet / Windy', 'Sand / Dust Whirlwinds', 'Mist / Windy', 'Drizzle / Windy', 'Duststorm', 'Sand / Dust Whirls Nearby', 'Thunder and Hail', 'Freezing Rain / Windy', 'Light Snow Shower / Windy', 'Partial Fog', 'Thunder / Wintry Mix / Windy', 'Patches of Fog / Windy', 'Rain and Sleet', 'Light Snow Grains', 'Partial Fog / Windy', 'Sand / Dust Whirlwinds / Windy', 'Heavy Snow with Thunder', 'Heavy Blowing Snow', 'Low Drifting Snow', 'Light Hail', 'Light Thunderstorm', 'Heavy Freezing Drizzle', 'Light Blowing Snow', 'Thunderstorms and Snow', 'Heavy Rain Showers', 'Rain Shower / Windy', 'Sleet and Thunder', 'Heavy Sleet and Thunder', 'Drifting Snow / Windy', 'Shallow Fog / Windy', 'Thunder and Hail / Windy', 'Heavy Sleet / Windy', 'Sand / Windy', 'Heavy Rain Shower / Windy', 'Blowing Snow Nearby', 'Blowing Sand', 'Heavy Rain Shower', 'Drifting Snow', 'Heavy Thunderstorms with Small Hail'],
'Harsh Conditions':['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy', 'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy','Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy', 'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy', 'Blowing Snow Nearby', 'Heavy Rain Shower / Windy','Hail','Volcanic Ash','Tornado']
#'Snowy': ['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy', 'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy'],
#'Windy': ['Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy', 'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy', 'Blowing Snow Nearby', 'Heavy Rain Shower / Windy'],
#'Hail': ['Hail'],
#'Volcanic Ash': ['Volcanic Ash'],
#'Tornado': ['Tornado']
}
def map_weather_to_bins(weather):
for bin_name, bin_values in weather_bins.items():
if weather in bin_values:
return bin_name
return 'Other'
df_11['Weather_Bin'] = df_11['Weather_Condition'].apply(map_weather_to_bins)
# Checking the unique values and their counts in the 'Severity' column
WeatherBin_counts = df_11['Weather_Bin'].value_counts()
WeatherBin_counts
Clear 3370412 Cloudy 3131679 Rainy 573883 Other 276572 Harsh Conditions 213519 Name: Weather_Bin, dtype: int64
# Filter the dataset for the specified weather conditions
df_clear = df_11[df_11['Weather_Bin'] == 'Clear']
df_cloudy = df_11[df_11['Weather_Bin'] == 'Cloudy']
df_harsh = df_11[df_11['Weather_Bin'] == 'Harsh Conditions']
df_rainy = df_11[df_11['Weather_Bin'] == 'Rainy']
df_other = df_11[df_11['Weather_Bin'] == 'Other']
len(df_clear)
len(df_cloudy)
len(df_harsh)
len(df_rainy)
len(df_other)
3370412
3131679
213519
573883
276572
# Calculate the mean values for visibility and severity for each weather condition
clear_visibility_mean = df_clear['Visibility'].mean()
clear_severity_mean = df_clear['Severity'].mean()
print("clear_visibility_mean:")
print(clear_visibility_mean)
print("clear_severity_mean:")
print(clear_severity_mean)
cloudy_visibility_mean = df_cloudy['Visibility'].mean()
cloudy_severity_mean = df_cloudy['Severity'].mean()
print("cloudy_visibility_mean:")
print(cloudy_visibility_mean)
print("cloudy_severity_mean:")
print(cloudy_severity_mean)
harsh_visibility_mean = df_harsh['Visibility'].mean()
harsh_severity_mean = df_harsh['Severity'].mean()
print("harsh_visibility_mean:")
print(harsh_visibility_mean)
print("harsh_severity_mean:")
print(harsh_severity_mean)
rainy_visibility_mean = df_rainy['Visibility'].mean()
rainy_severity_mean = df_rainy['Severity'].mean()
print("rainy_visibility_mean:")
print(rainy_visibility_mean)
print("rainy_severity_mean:")
print(rainy_severity_mean)
other_visibility_mean = df_other['Visibility'].mean()
other_severity_mean = df_other['Severity'].mean()
print("other_visibility_mean:")
print(other_visibility_mean)
print("other_severity_mean:")
print(other_severity_mean)
clear_visibility_mean: 9.852139471376198 clear_severity_mean: 2.187872283863219 cloudy_visibility_mean: 9.579545553679031 cloudy_severity_mean: 2.2383638297539434 harsh_visibility_mean: 4.803740276041009 harsh_severity_mean: 2.218214772455847 rainy_visibility_mean: 6.001854785731588 rainy_severity_mean: 2.2487859023529184 other_visibility_mean: 4.567289277294882 other_severity_mean: 2.1923079704380775
# Calculate the total records and severity distribution percentages for each weather condition
clear_total_records = len(df_clear)
clear_severity_1 = len(df_clear[df_clear['Severity'] == 1])
clear_severity_2 = len(df_clear[df_clear['Severity'] == 2])
clear_severity_3 = len(df_clear[df_clear['Severity'] == 3])
clear_severity_4 = len(df_clear[df_clear['Severity'] == 4])
clear_severity_1_pct = (clear_severity_1 / clear_total_records) * 100
clear_severity_2_pct = (clear_severity_2 / clear_total_records) * 100
clear_severity_3_pct = (clear_severity_3 / clear_total_records) * 100
clear_severity_4_pct = (clear_severity_4 / clear_total_records) * 100
harsh_total_records = len(df_harsh)
#harsh_severity_low = len(df_harsh[df_harsh['Severity'] <= 2])
#harsh_severity_high = len(df_harsh[df_harsh['Severity'] >= 3])
#harsh_severity_low_pct = (harsh_severity_low / harsh_total_records) * 100
#harsh_severity_high_pct = (harsh_severity_high / harsh_total_records) * 100
harsh_severity_1 = len(df_harsh[df_harsh['Severity'] == 1])
harsh_severity_2 = len(df_harsh[df_harsh['Severity'] == 2])
harsh_severity_3 = len(df_harsh[df_harsh['Severity'] == 3])
harsh_severity_4 = len(df_harsh[df_harsh['Severity'] == 4])
harsh_severity_1_pct = (harsh_severity_1 / harsh_total_records) * 100
harsh_severity_2_pct = (harsh_severity_2 / harsh_total_records) * 100
harsh_severity_3_pct = (harsh_severity_3 / harsh_total_records) * 100
harsh_severity_4_pct = (harsh_severity_4 / harsh_total_records) * 100
cloudy_total_records = len(df_cloudy)
#cloudy_severity_low = len(df_cloudy[df_cloudy['Severity'] <= 2])
#cloudy_severity_high = len(df_cloudy[df_cloudy['Severity'] >= 3])
#cloudy_severity_low_pct = (cloudy_severity_low / cloudy_total_records) * 100
#cloudy_severity_high_pct = (cloudy_severity_high / cloudy_total_records) * 100
cloudy_severity_1 = len(df_cloudy[df_cloudy['Severity'] == 1])
cloudy_severity_2 = len(df_cloudy[df_cloudy['Severity'] == 2])
cloudy_severity_3 = len(df_cloudy[df_cloudy['Severity'] == 3])
cloudy_severity_4 = len(df_cloudy[df_cloudy['Severity'] == 4])
cloudy_severity_1_pct = (cloudy_severity_1 / cloudy_total_records) * 100
cloudy_severity_2_pct = (cloudy_severity_2 / cloudy_total_records) * 100
cloudy_severity_3_pct = (cloudy_severity_3 / cloudy_total_records) * 100
cloudy_severity_4_pct = (cloudy_severity_4 / cloudy_total_records) * 100
rainy_total_records = len(df_rainy)
#rainy_severity_low = len(df_rainy[df_rainy['Severity'] <= 2])
#rainy_severity_high = len(df_rainy[df_rainy['Severity'] >= 3])
#rainy_severity_low_pct = (rainy_severity_low / rainy_total_records) * 100
#rainy_severity_high_pct = (rainy_severity_high / rainy_total_records) * 100
rainy_severity_1 = len(df_rainy[df_rainy['Severity'] == 1])
rainy_severity_2 = len(df_rainy[df_rainy['Severity'] == 2])
rainy_severity_3 = len(df_rainy[df_rainy['Severity'] == 3])
rainy_severity_4 = len(df_rainy[df_rainy['Severity'] == 4])
rainy_severity_1_pct = (rainy_severity_1 / rainy_total_records) * 100
rainy_severity_2_pct = (rainy_severity_2 / rainy_total_records) * 100
rainy_severity_3_pct = (rainy_severity_3 / rainy_total_records) * 100
rainy_severity_4_pct = (rainy_severity_4 / rainy_total_records) * 100
other_total_records = len(df_other)
#other_severity_low = len(df_other[df_other['Severity'] <= 2])
#other_severity_high = len(df_other[df_other['Severity'] >= 3])
#other_severity_low_pct = (other_severity_low / other_total_records) * 100
#other_severity_high_pct = (other_severity_high / other_total_records) * 100
other_severity_1 = len(df_other[df_other['Severity'] == 1])
other_severity_2 = len(df_other[df_other['Severity'] == 2])
other_severity_3 = len(df_other[df_other['Severity'] == 3])
other_severity_4 = len(df_other[df_other['Severity'] == 4])
other_severity_1_pct = (other_severity_1 / other_total_records) * 100
other_severity_2_pct = (other_severity_2 / other_total_records) * 100
other_severity_3_pct = (other_severity_3 / other_total_records) * 100
other_severity_4_pct = (other_severity_4 / other_total_records) * 100
# Display the calculated means and percentages
mean_values_and_percentages = {
'Weather': ['Clear', 'Harsh Conditions', 'Cloudy','Rainy','Other'],
'Visibility Mean': [clear_visibility_mean, harsh_visibility_mean, cloudy_visibility_mean, rainy_visibility_mean, other_visibility_mean],
'Severity Mean': [clear_severity_mean, harsh_severity_mean, cloudy_severity_mean, rainy_severity_mean, other_severity_mean],
'Total Records': [clear_total_records, harsh_total_records, cloudy_total_records, rainy_total_records, other_total_records],
'Severity 1': [clear_severity_1_pct, harsh_severity_1_pct, cloudy_severity_1_pct, rainy_severity_1_pct, other_severity_1_pct],
'Severity 2': [clear_severity_2_pct, harsh_severity_2_pct, cloudy_severity_2_pct, rainy_severity_2_pct, other_severity_2_pct],
'Severity 3': [clear_severity_3_pct, harsh_severity_3_pct, cloudy_severity_3_pct, rainy_severity_3_pct, other_severity_3_pct],
'Severity 4': [clear_severity_4_pct, harsh_severity_4_pct, cloudy_severity_4_pct, rainy_severity_4_pct, other_severity_4_pct]
}
# Convert the dictionary to a DataFrame for plotting
df_summary = pd.DataFrame(mean_values_and_percentages)
df_summary
| Weather | Visibility Mean | Severity Mean | Total Records | Severity 1 | Severity 2 | Severity 3 | Severity 4 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Clear | 9.85 | 2.19 | 3370412 | 1.01 | 81.77 | 14.66 | 2.57 |
| 1 | Harsh Conditions | 4.80 | 2.22 | 213519 | 0.31 | 81.00 | 15.23 | 3.45 |
| 2 | Cloudy | 9.58 | 2.24 | 3131679 | 0.80 | 77.19 | 19.38 | 2.63 |
| 3 | Rainy | 6.00 | 2.25 | 573883 | 0.65 | 76.46 | 20.25 | 2.64 |
| 4 | Other | 4.57 | 2.19 | 276572 | 0.55 | 82.02 | 15.07 | 2.36 |
# Creating a heatmap for the calculated mean values and percentages
# We need to transpose the summary DataFrame to have Weather Conditions as columns for the heatmap
df_summary_transposed = df_summary.set_index('Weather').T
# Plotting the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df_summary_transposed, annot=True, fmt=".2f", cmap='coolwarm', linewidths=.3)
plt.title("Heatmap of Weather Conditions vs Visibility and Severity Metrics")
plt.ylabel("Metrics")
plt.show()
<Figure size 1000x600 with 0 Axes>
<Axes: xlabel='Weather'>
Text(0.5, 1.0, 'Heatmap of Weather Conditions vs Visibility and Severity Metrics')
Text(95.72222222222221, 0.5, 'Metrics')
Key Insights
Managerial recommendation:
Impact on ADAS (Advanced Driving Assistance System):
In Harsh Conditions, where visibility is significantly reduced, ADAS features like adaptive headlights, night vision systems, and forward-collision warnings should adapt according to above insights and assist in detecting obstacles on the road and help prevent accidents.
sum_distance_severity_1 = df_11.loc[df_11['Severity'] == 1, 'Distance'].sum()
count_distance_severity_1 = df_11.loc[df_11['Severity'] == 1, 'Distance'].count()
print("Sum of Distance for Severity 1:", sum_distance_severity_1)
print("Count of Distance for Severity 1:", count_distance_severity_1)
Distance_Sev1 = sum_distance_severity_1/count_distance_severity_1
Distance_Sev1
Sum of Distance for Severity 1: 7228.2770003972655 Count of Distance for Severity 1: 65007
0.11119228699058971
sum_distance_severity_2 = df_11.loc[df_11['Severity'] == 2, 'Distance'].sum()
count_distance_severity_2 = df_11.loc[df_11['Severity'] == 2, 'Distance'].count()
print("Sum of Distance for Severity 2:", sum_distance_severity_2)
print("Count of Distance for Severity 2:", count_distance_severity_2)
Distance_Sev2 = sum_distance_severity_2/count_distance_severity_2
Distance_Sev2
Sum of Distance for Severity 2: 3357955.815013315 Count of Distance for Severity 2: 6011746
0.5585658168214883
sum_distance_severity_3 = df_11.loc[df_11['Severity'] == 3, 'Distance'].sum()
count_distance_severity_3 = df_11.loc[df_11['Severity'] == 3, 'Distance'].count()
print("Sum of Distance for Severity 3:", sum_distance_severity_3)
print("Count of Distance for Severity 3:", count_distance_severity_3)
Distance_Sev3 = sum_distance_severity_3/count_distance_severity_3
Distance_Sev3
Sum of Distance for Severity 3: 546066.5290523764 Count of Distance for Severity 3: 1291377
0.42285601265345163
sum_distance_severity_4 = df_11.loc[df_11['Severity'] == 4, 'Distance'].sum()
count_distance_severity_4 = df_11.loc[df_11['Severity'] == 4, 'Distance'].count()
print("Sum of Distance for Severity 4:", sum_distance_severity_4)
print("Count of Distance for Severity 4:", count_distance_severity_4)
Distance_Sev4 = sum_distance_severity_4/count_distance_severity_4
Distance_Sev4
Sum of Distance for Severity 4: 296317.4930206464 Count of Distance for Severity 4: 197935
1.4970444490395656
# Assuming severity_means is a Series or list containing mean values for each severity level
# severity_values should contain the corresponding severity values (1, 2, 3, 4)
# For example:
Distance_Sev_Means = [Distance_Sev1, Distance_Sev2, Distance_Sev3, Distance_Sev4]
severity_values = [1, 2, 3, 4]
# Create a DataFrame from the means and severity values
df_means = pd.DataFrame({'Severity': severity_values, 'Mean_Distance': Distance_Sev_Means})
# Define a color palette for each severity level
severity_palette = {1: 'green', 2: 'brown', 3: 'orange', 4: 'red'}
# Plot the bar plot
plt.figure(figsize=(12, 8))
sns.barplot(x='Severity', y='Mean_Distance', data=df_means, palette=severity_palette, width=0.2, dodge=False)
plt.title('Bar Plot of Mean Distance against Severity')
plt.xlabel('Severity')
plt.ylabel('Mean Distance')
plt.show()
<Figure size 1200x800 with 0 Axes>
<Axes: xlabel='Severity', ylabel='Mean_Distance'>
Text(0.5, 1.0, 'Bar Plot of Mean Distance against Severity')
Text(0.5, 0, 'Severity')
Text(0, 0.5, 'Mean Distance')
Key Insight: Severity 1 accidents have the shortest mean distance, while Severity 4 accidents impact significantly longer stretches of road. This could indicate that accidents with high severity tend to involve larger areas, potentially due to factors such as more vehicles being involved.
Interesting finding: Severity 2 accidents, while not the most impactful in terms of traffic delay, affect a longer stretch of road than Severity 3 accidents. This could suggest several scenarios:
Severity 2 accidents may involve incidents that, although not causing severe traffic delays, cover a larger area. This might be due to incidents that result in obstructions or hazards spread over a greater distance, causing moderate traffic slowdowns.
Severity 3 accidents, while affecting a smaller area, could be more concentrated and cause significant delays, possibly due to the road being blocked or more intensive emergency services response required.
1) a) Insight :
- You will be amazed with the fact that almost 34% of accidents have occoured even though there were Traffic Signal, Crossing and Junction present in location.((Traffic Signal) 14.84 + (Crossing) 11.35 + (Junction) 7.35 =33.54 %)
- On detailed analysis of these factors by diving the US states into 7 zones based on geographical conditions, it is quite evident from the graph that 68.47% of accidents in south east region occurred even though these 3 factors(Traffic signal,Crossing and Junction) were present there.
South East - Traffic Signal) 31.27 + (Crossing) 28.7 + (Junction) 8.50 = 68.47 %
b) Managerial Recommendation :
- More penalties must be levied by the Government for traffic rules violations( specially in south east region) so that people would be more cautious while driving and accidents count would be less.
b) Managerial Insights:
- Get more bumps! The government should increase the number of signal bumps in high-accident areas could effectively reduce accident rates. Companies involved in road safety solutions could see increased demand for signal bumps and related traffic-calming products.
a) Insight
b) Managerial recommendation:
Impact on ADAS (Advanced Driving Assistance System): In Harsh Conditions, where visibility is significantly reduced, ADAS features like adaptive headlights, night vision systems, and forward-collision warnings should adapt according to above insights and assist in detecting obstacles on the road and help prevent accidents.
(Note: ADAS system is typically used in self-driving cars and automotives.)